unchecked_emplace_back, unchecked_emplace and self_assign

Andersama AUG 19, 2021

Authors: Alexander Anderson

Abstract

1. Revision History

Initial Draft

unchecked_emplace_back and unchecked_emplace are variations of two functions which already exist in the standard for several containers. The key difference being these unchecked_ variants do not / will not allocate, they assume the capacity of already allocated storage is at least 1 more than the current size. For the purposes of this paper we can consider the signatures to unchecked_emplace_back and unchecked_emplace match those of their "checked" counterparts. The most relevant container is vector, for the purpose of this proposal consider the proposal to add these functions to all containers which function similarly to vector. Assume the worst of these functions in terms of safety, they can and will corrupt data, reach and dereference memory out of bounds and are the essence of running with scissors.

//...imagine somewhere buried in MSVC's vector standard header
template <class... _Valty>
_CONSTEXPR20_CONTAINER decltype(auto) unchecked_emplace_back(_Valty &&..._Val) {
  return _Emplace_back_with_unused_capacity(_STD forward<_Valty>(_Val)...);
}

self_assign is a new function, which is similar to assign, like the above unchecked_ functions, self_assign will not allocate. It still "replaces" the containers contents, however self_assign is much more restricted in usage.

//...imagine somewhere buried in MSVC's vector standard header
template <class _Iter>
_CONSTEXPR20_CONTAINER void _Self_Assign(_Iter _Last) {
  auto &   _My_data = _Mypair._Myval2;
  pointer &_Mylast  = _My_data._Mylast;
  pointer &_Myend   = _My_data._Myend;

  _Adl_verify_range(_Last, _My_data._Myend);
  _Mylast = _Last;
}

template <class _Iter, enable_if_t<_Is_iterator_v<_Iter>, int> = 0>
_CONSTEXPR20_CONTAINER void self_assign(_Iter _Last) {
  _Self_Assign(_Last);
}

//...perhaps written a bit more human like
//...assuming our container's implementation of vector stores three iterators _begin, _end, and _capacity
//where size() returns _end - _begin, capacity() returns _capacity and data() and begin() return _begin
template <class Iterator, enable_if_t<_Is_iterator_v<Iterator>, int> = 0>
constexpr void self_assign(Iterator last) {
    assert(last >= begin() && last <= (begin()+capcity()));
    _end = last;
}

This example hopefully gives away the rough purpose and usage of self_assign, it is a single parameter function which takes an iterator to the new end of the array. It assumes the memory is already managed, no lifetime management of the data is performed, if self_assign were used to expand the container it will be the responsibility of the maintainer to construct data otherwise if self_assign were used to shrink the container the maintainer is expected to have destroyed data as needed. It also expects the iterator to be within the existing bounds between begin() and begin()+capacity(), this may be (and should be) asserted. It's also a good idea to limit the function to only be enabled if the input type is in fact an iterator.

2. Motivation

unchecked_emplace_back and unchecked_emplace to a certain degree already exist in most standard library implementations, with these the goal is more or less to present more of the internal utilities already written with performance in mind to the user and to prevent unnecessary workarounds to get to the same performance as the backend has available. Consider below:

std::vector<char> a_pseudo_string;
a_pseudo_string.reserve(32);
for (size_t i = 0; i < 32; i++) {
  a_pseudo_string.emplace_back(i+32);
}

While std::vector<char> is not std::string and this small snippet is not slow by any stretch of the imagination, parts of this snippet may be relying on some compiler optimizations to kick in. If we're lucky the compiler might notice what we do intuitively in this example which is that we don't need emplace_back to check on the capacity of the vector, even better and more likely is that the compiler will remove the allocation entirely given how many constants are given like this. However the key concept here is that we know, or rather have written a loop to fill our container with data we need, and we don't need to check our container's capacity().

std::vector<char> a_pseudo_string;
a_pseudo_string.reserve(32);
for (size_t i = 0; i < 32; i++) {
  //does what it says on the tin
  a_pseudo_string.unchecked_emplace_back(i+32);
}

We can also roughly imagine similar benefits for calls to emplace, which is the backbone for insert.

2.1 Current state of affairs

Here is where things become painful, std::vector is pretty well guarded in that while we may absolutely abuse the access to data() to manipulate any number of indicies, there is no nice way of controlling size(). This makes for an awkwardly lobsided api for anyone wanting or needing to etch out the last bit of performance and try for lower level access. There are no public methods with which we can modify the internals of the container without avoiding unnecessary bounds checking, unnecessary copies, unnecessary construction or destruction calls. If we need to change the size() in order to maintain the validity of the vector after our manipulations, we must pay a higher level cost which may defeat the purpose.

Obviously this is no accident, a well designed api for a class after all should be constrained so that misuse and errors can be handled well. However if we were aiming for safety through and through, we likely shouldn't have operator[] nor data() while we have at(). This paper ideally lends out a hand to all those who work at the lower level so that we don't have to write our own std::vector or similar just to scrape a few % points that are clearly available, or to those who're still very much in love with c and are not afraid of pointer dereferences or array access.

2.2 The assign approach

Lets consider what it currently looks like if we wanted to do something similar to unchecked_emplace_back.

std::vector<char> a_pseudo_string;
a_pseudo_string.reserve(32);

char *danger_ptr = a_pseudo_string.data();
for (size_t i = 0; i < 32; i++) {
  danger_ptr[i] = i+32;
}
// an incredibly awkward use of assign
a_psudo_string.assign(danger_ptr, danger_ptr+32);

Depending on the datatypes we're using assign can be extremely penalizing as it can do a full gambit of additional work. Here at least with a bit of trivial data it's not so bad, but anything complex we will end up paying for. assign is designed towards assigning data as quick as humanly possible, an implementation might make self assignment a specialized case to do as little work as possible, however much like the unchecked_ variations above we can avoid a branch because we know a bit more about the problem.

When I originally was motivated to modify <vector> bizzarely it was because I wrote something like the above, but didn't consider anything about assign. I immediately was drawn to the idea of unchecked_emplace_back because I knew it already existed and wouldn't take much work, and my code was absolutely strewn with emplace_back. However as shown above, self_assign also did not take much work. As a bonus it also has a nice looking pattern:

std::vector<some_type> buffer;
buffer.reserve(maximum_width);
size_t count = 0;
for (size_t i = 0; i < maximum_width; i++) {
  //etc...
}
buffer.self_assign(buffer.data()+count);

Here, reserve() and self_assign() form their own neatly enclosed block. It can be even better, because in a real example even if we're appending conditionally we can still take advantage of a branchless loop:

std::vector<size_t> buffer;
buffer.reserve(maximum_width);
size_t count = 0;
//a loop to store odd numbers
for (size_t i = 0; i < maximum_width; i++) {
  //no ifs required, no branches
  bool is_odd = i % 2;
  buffer.data()[count] = i;
  count += is_odd;
}
buffer.self_assign(buffer.data()+count);

Here we can get something out of self_assign we currently can't out of emplace_back or similar, of course the benefit is solely because we've already saved ourselves a branch and avoided any allocation. If we have to pay for a branch because we have an unknown capacity() requirement, we should be using emplace_back.

3. Considerations

All of these functions likely violate some degree of safety which the api has been striving to achieve, namely that they don't just "work". There are specific conditions to their use and they operate in tandem with a well placed reserve(). As someone explained (and I'll need help to cite who because I can't remember) there's an additional difficulty in managing pairs of things. However c and c++ are littered with examples, malloc and free, new and delete and I'm sure a number of others which I also can't remember from the talk. This makes them error prone and are the source of many existing security concerns. In very weird applications the management of these pairs is almost given up entirely, namely in compilers like DMD where the program is simply not meant to be executing for long.

That said these functions are far much more like malloc and free, new and delete in that while they may be "new" to the publicly facing api, unchecked_emplace_back and unchecked_emplace are very much not new, especially to library implementers, nor to anyone who has peaked at a <vector> header. These utility functions have been lying in the background optimizing edge cases in much the way we might expect.

Of the functions presented self_assign is very much an explicit deviation from the safety oriented api, namely it does obfuscate its ability to modify size(), although I very much think modify_size() would be far more egregious. I would fully expect that self_assign come with assertions for anyone safety conscious. Unlike the unchecked_ variations it does in fact provide a new form of functionality, namely that it allows for more complicated (and performant) appends. I don't imagine it would be long for someone to use it for more complicated transformations as well.

4. Real World

Wouldn't be too convincing without numbers here's a few quick runs with a benchmark of the earlier examples:

| MSVC x64 debug
|               ns/op |                op/s |    err% |     total | emplace_back benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|               91.60 |       10,917,030.57 |    0.5% |      0.00 | `emplace_back`
|               31.82 |       31,421,838.18 |    0.4% |      0.00 | `unchecked_emplace_back`
|                9.04 |      110,663,983.90 |    3.4% |      0.00 | `assign`
|                8.65 |      115,606,936.42 |    0.1% |      0.00 | `self_assign`

| MSVC x64 release
|               ns/op |                op/s |    err% |     total | emplace_back benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                0.97 |    1,026,392,961.88 |    0.1% |      0.00 | `emplace_back`
|                0.89 |    1,126,637,554.59 |    0.1% |      0.00 | `unchecked_emplace_back`
|                0.57 |    1,741,245,136.19 |    0.1% |      0.00 | `assign`
|                0.53 |    1,899,335,232.67 |    0.0% |      0.00 | `self_assign`

| Clang x64 debug
|               ns/op |                op/s |    err% |     total | emplace_back benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|               21.06 |       47,483,380.82 |    0.1% |      0.00 | `emplace_back`
|               20.40 |       49,019,607.84 |    0.1% |      0.00 | `unchecked_emplace_back`
|                2.78 |      359,640,359.64 |    0.1% |      0.00 | `assign`
|                2.71 |      369,290,573.37 |    0.0% |      0.00 | `self_assign`

| Clang x64 release
|               ns/op |                op/s |    err% |     total | emplace_back benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                1.77 |      564,440,263.41 |    0.1% |      0.00 | `emplace_back`
|                0.80 |    1,256,544,502.62 |    0.2% |      0.00 | `unchecked_emplace_back`
|                0.55 |    1,803,127,874.89 |    0.0% |      0.00 | `assign`
|                0.53 |    1,873,767,258.38 |    0.0% |      0.00 | `self_assign`

and the code, in the event I've butchered the benchmark. op should roughly translate to the cost of the write of an unsigned int. Strangely enough, and bit of a teaching moment, the assign pattern is the way to go (in debug or release).

#include "vector.h"

#define ANKERL_NANOBENCH_IMPLEMENT
#include "nanobench.h"
#undef ANKERL_NANOBENCH_IMPLEMENT

int main()
{
  ankerl::nanobench::Bench bench;
  
  size_t count = 0;
  //to force the compiler to avoid optimizing anything, this should be randomized at runtime
  size_t batch = 1000;
  
  bench.batch(batch);

  //increase above 100 ms for more stable results, the performance benefits here are small enough
  //that cold runs vary a bit
  //bench.minEpochTime(std::chrono::seconds{ 1 });
  
  std::vector<uint32_t> buffer;
  buffer.reserve(batch);
  
  bench.run("emplace_back", [&buffer, &count, &batch]() {
    buffer.clear();
    for (size_t i = 0; i < batch; i++) {
      if (i % 2) {
        buffer.emplace_back(i);
      }
    }
    count += buffer.size();
  });
  
  
  bench.run("unchecked_emplace_back", [&buffer, &count, &batch]() {
    buffer.clear();
    for (size_t i = 0; i < batch; i++) {
      if (i % 2) {
        buffer.unchecked_emplace_back(i);
      }
    }
    count += buffer.size();
  });
  
  bench.run("assign", [&buffer, &count, &batch]() {
    buffer.clear();
    uint32_t* ptr = buffer.data();
    size_t current = 0;
    for (size_t i = 0; i < batch; i++) {
      ptr[current] = i;
      current += (i % 2);
    }
    buffer.assign(buffer.data(), buffer.data() + current);
    count += buffer.size();
  });
  
  bench.run("self_assign", [&buffer, &count, &batch]() {
    buffer.clear();
    uint32_t* ptr = buffer.data();
    size_t current = 0;
    for (size_t i = 0; i < batch; i++) {
      ptr[current] = i;
      current += (i % 2);
    }
    buffer.self_assign(buffer.data() + current);
    count += buffer.size();
  });
  
  return 0;
}

Here are some results when we try a benchmark against 1,000,000 ints

| MSVC x64 release
|               ns/op |                op/s |    err% |     total | emplace_back benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                0.96 |    1,045,854,271.36 |    0.1% |      0.01 | `emplace_back`
|                0.80 |    1,251,252,505.01 |    0.1% |      0.01 | `unchecked_emplace_back`
|                0.65 |    1,528,457,772.34 |    0.1% |      0.01 | `assign`
|                0.47 |    2,149,774,047.77 |    0.4% |      0.01 | `self_assign`

| Clang x64 release
|               ns/op |                op/s |    err% |     total | emplace_back benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                1.58 |      631,572,279.23 |    0.1% |      0.02 | `emplace_back`
|                0.80 |    1,255,035,246.73 |    0.1% |      0.01 | `unchecked_emplace_back`
|                0.70 |    1,430,826,636.05 |    0.6% |      0.01 | `assign`
|                0.53 |    1,883,619,875.31 |    0.3% |      0.01 | `self_assign`

Here the results are a bit more pronounced, here's some % differences between the earlier tests from Clang.

| Clang x64 release (1000 ints)
|               ns/op |                op/s |    err% |     total | emplace_back benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                0.55 |    1,803,127,874.89 |    0.0% |      0.00 | `assign`
|                0.53 |    1,873,767,258.38 |    0.0% |      0.00 | `self_assign`
|--------------------:|--------------------:|--------:|----------:|:----------
| 1,873,767,258.38 / 1,803,127,874.89 = 1.03917 | 3.917% relative speedup

| Clang x64 release (1000000 ints)
|               ns/op |                op/s |    err% |     total | emplace_back benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                0.70 |    1,430,826,636.05 |    0.6% |      0.01 | `assign`
|                0.53 |    1,883,619,875.31 |    0.3% |      0.01 | `self_assign`
|--------------------:|--------------------:|--------:|----------:|:----------
| 1,883,619,875.31 / 1,430,826,636.05 = 1.31645 | 31.645% relative speedup

| Clang x64 release (1000 ints)
|               ns/op |                op/s |    err% |     total | emplace_back benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                1.77 |      564,440,263.41 |    0.1% |      0.00 | `emplace_back`
|                0.80 |    1,256,544,502.62 |    0.2% |      0.00 | `unchecked_emplace_back`
|--------------------:|--------------------:|--------:|----------:|:----------
| 1,256,544,502.62 / 564,440,263.41 = 2.22617 | 122.617% relative speedup

| Clang x64 release (1000000 ints)
|               ns/op |                op/s |    err% |     total | emplace_back benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                1.58 |      631,572,279.23 |    0.1% |      0.02 | `emplace_back`
|                0.80 |    1,255,035,246.73 |    0.1% |      0.01 | `unchecked_emplace_back`
|--------------------:|--------------------:|--------:|----------:|:----------
| 1,255,035,246.73 / 631,572,279.23 = 1.98716 | 98.716% relative speedup

Note this test when using emplace_back is awful on the branch predictor, but hopefully this illustrates the wide range of performance impact these functions may have.